This documents reports the details behind the generation of the E. coli feature set. It is based mainly on RegulonDB’s curated data, and a few other sources detailed hereafter.
Ecoli feature set
The following sets are generated:
1. Gene set
A table generated by merging all gene-related information from RegulonDB and Zika, indexed by a consensus bnumber, and containing exhaustive synonyms for genes and their products.
2. TSS set and Promoter set
TSSs are gathered from RegulonDB, HT datasets (Mendoza-Vargas et al., 2009; Kim et al., 2012; Cho eet al., 2014; Thomason et al., 2014; Yan et al., 2018), and an unpublished collection from the Wade group. Merged on the basis of their coordinates and strand.
Promoters are regions containing 1 or more TSSs, where each TSS is at most 4 bp away from another TSS. A promoter opens with a TSS, and expands as long as there is another TSS in the following 4 positions (using a 5-nt window, meaning the maximum “empty” space between them is 3 nt). It closes with the last TSS of the sequence.
Promoter definition
3. Transcription unit set and co-transcribed genes set
Transcription units are defined by their unique coordinates and strand. Experimental TUs are extracted from RegulonDB, HT TUs from a public dataset (Yan et al., 2018), and custom orphan TUs are made of orphans genes. They are then merged on the basis of their coordinates and strand.
Every group of co-transcribed genes (CTG) is made of genes that are co-transcribed together at least once. The CTG set is derived from the TU set. TUs that contain exactly the same complete genes are grouped into a CTG sets, and the widest coordinates are kept. A given gene can be in several CTG sets, but 2 CTG sets cannot contains exactly the same genes.
Both sets come with a “operon_name” column. Here, an operon is a set of adjacent genes made of “one or several mutually overlapping transcription units that are transcribed in the same direction and share at least one gene”, as proposed by Mejía-Almonte et al. (by Mejía-Almonte et al., 2020). It is purely informative, and may not match with known operons.
TU and CTG definition
4. Binding sites set
Set of curated binding sites from RegulonDB. They are merged using their coordinates, TF bnumber and effect (+ or -). When the TF is a heterodimer, 2 entries are created: one per bnumber.
Master gene table, with exhaustive synonyms gathered from RegulonDB, Ecocyc, Zika, etc.
–>
The following files are created:
Feature_set_2021-09-21/tss_set_2021-09-21.tsv
Feature_set_2021-09-21/promoter_set_2021-09-21.tsv
Feature_set_2021-09-21/tss_promoter_map_2021-09-21.tsv
This set is not included
PromoterSet.txt file downloaded from RegulonDB website (version 10.8)Defined by TSSs merged using a 5bp-sliding window.
NB: coordinates-less TSSs are removed.
Promoter definition
TSSs without filtering: 65409
TSSs after duplicate merging: 28987
Promoters: 23316
Promoters were built using a 5-bp-sliding window to group close by TSSs.
Transcription units are defined by their unique coordinates and strand. Experimental TUs are extracted from RegulonDB, HT TUs from a public dataset (Yan et al., 2018), and custom orphan TUs are made for remaining “orphans genes”, or genes that are not entirely covered by any TU. They are then merged on the basis of their coordinates and strand.
Every group of co-transcribed genes (CTG) is made of genes that are entirely co-transcribed together at least once. The CTG set is derived from the TU set. TUs that contain exactly the same complete genes are grouped into a CTG sets, and the widest coordinates are kept. A given gene can be in several CTG sets, but 2 CTG sets cannot contain exactly the same genes. Every gene from Zika’s genesView is present in at least one CTG set.
Both sets come with a “operon_name” column. Here, an operon is a set of adjacent genes made of “one or several mutually overlapping transcription units that are transcribed in the same direction and share at least one gene”, as proposed by Mejía-Almonte et al. (by Mejía-Almonte et al., 2020). It is purely informative, and may not match with known operons.
TU and CTG definition
Notes:
The following files and fields are created:
Feature_set_2021-09-21/tu_set_2021-09-21.tsv
Feature_set_2021-09-21/ctg_set_2021-09-21.tsv
Feature_set_2021-09-21/ctg_tu_map_2021-09-21.tsv
Feature_set_2021-09-21/ctg_gene_map_2021-09-21.tsv
Get all TUs from RegulonDB by directly querying the database, with their associated promoter (if any), and first and last gene positions
Get terminators associated to TUs
NB: here I do not map terminators objects with TUs, only their position is used as a TU end coordinate
Merge experimental TUs, promoters, terminator and gene coordinates
Get PacBio TUs from the Yan paper (reference)
Files provided by Victor: link
A few bnumbers have to be updated to new ones (using master gene file):
This may change artificially the number and order of genes in those TUs. For example, the TU “b1417,b1416” becomes “b4493”.
Custom IDs are created for PacBIO TUs as follows: PB_GC_TUdefinition_XXX
Get valid genes contained in TUs
Get genes that are not present and valid in a TU from RegulonDB or PacBio and make them “orphan TUs”
Alternative IDs are created for orphan TUs as follows: orphan_XXX
TUs are merged into single entries based on valid gene content duplicates
A table is created to map TUs with CTGs
Total TUs: 8511
Total TUs without duplicate coordinates: 8221
Total TUs without duplicate gene content (CTG set): 4283
Based on all collected TUs before any sort of merging.
NB: y axis is the number of TUs duplicated, x axis is the duplication factor
NB: y axis is the number of TUs duplicated, x axis is the duplication factor
Based on RegulonDB version 10.8, downloaded here.
Weak-evidence sites are removed.
The following files and fields are created:
Feature_set_2021-09-21/tfbs_set_2021-09-21.tsv
The following file is created:
Feature_set_2021-09-21/feature_map_2021-09-21.tsv